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Abstract 

For the universal hypothesis testing problem, where the goal is to decide between the known null 
hypothesis distribution and some other unknown distribution, Hoeffding proposed a universal test in the 
nineteen sixties. Hoeffding's universal test statistic can be written in terms of Kullback-Leibler (K-L) 
divergence between the empirical distribution of the observations and the null hypothesis distribution. In 
this paper a modification of Hoeffding's test is considered based on a relaxation of the K-L divergence, 
referred to as the mismatched divergence. The resulting mismatched test is shown to be a generalized 
likelihood-ratio test (GLRT) for the case where the alternate distribution lies in a parametric family of 
distributions characterized by a finite dimensional parameter, i.e., it is a solution to the corresponding 
composite hypothesis testing problem. For certain choices of the alternate distribution, it is shown that 
both the Hoeffding test and the mismatched test have the same asymptotic performance in terms of 
error exponents. A consequence of this result is that the GLRT is optimal in differentiating a particular 
distribution from others in an exponential family. It is also shown that the mismatched test has a significant 
advantage over the Hoeffding test in terms of finite sample size performance for applications involving 
large alphabet distributions. This advantage is due to the difference in the asymptotic variances of the 
two test statistics under the null hypothesis. 
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I. Introduction and Background 



This paper is concerned with the following hypothesis testing problem: Suppose that the observations 
Z = {Z t : t = 1, . . .} form an i.i.d. sequence evolving on a set of cardinality N, denoted by Z = 
{z\, Z2, ■ ■ • , zn}- Based on observations of this sequence we wish to decide if the marginal distribution 
of the observations is a given distribution 7r°, or some other distribution ir 1 that is either unknown or 
known only to belong to a certain class of distributions. When the observations have distribution ir° we 
say that the null hypothesis is true, and when the observations have some other distribution it 1 we say 
that the alternate hypothesis is true. 

A decision rule is characterized by a sequence of tests cf> := {<p n : n > 1}, where <p n : Z n h-> {0, 1} 
with Z" representing the n-th order Cartesian-product of Z. The decision based on the first n elements 
of the observation sequence is given by <j} n (Z\, Z-z, ■ ■ ■ , Z n ), where 4> n = represents a decision in favor 
of accepting 7r° as the true marginal distribution. 

The set of probability measures on Z is denoted V(Z). The relative entropy (or Kullback-Leibler 
divergence) between two distributions ly 1 ,^ 2 £ V(Z) is denoted D{v l \\v 2 ), and for a given /i E V(Z) 
and T] > the divergence ball of radius i] around p, is defined as, 



The empirical distribution or type of the finite set of observations (Zi, Z2, . . . , Z n ) is a random variable 
T n taking values in V(Z): 



where I denotes the indicator function. 

In the general universal hypothesis testing problem, the null distribution tt° is known exactly, but 
no prior information is available regarding the alternate distribution it 1 . Hoeffding proposed in ||2l 
a generalized likelihood-ratio test (GLRT) for the universal hypothesis testing problem, in which the 
alternate distribution ir 1 is unrestricted — it is an arbitrary distribution in V(Z), the set of probability 
distributions on Z. Hoeffding's test sequence is given by, 



Q v (p):={ueP(Z):D(u\\p)<r,}. 



(1) 




(2) 



i=i 




(3) 
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It is easy to see that the Hoeffding test ® can be rewritten as follows: 

i=i x % > 



i=l 

T n (z) 

ze z 11 ^ (4) 



= I{ J D(r n ||vr u ) > r/} 
= I{r^Q,(vr )} 

If we have some prior information on the alternate distribution ir 1 , a different version of the GLRT 
is used. In particular, suppose it is known that the alternate distribution lies in a parametric family of 
distributions of the following form: 

£ w » := {7r r : r G R d }. 

where fr r € 'P(Z) are probability distributions on Z parameterized by a parameter r € M. d . The specific 
form of 7r r is defined later in the paper. In this case, the resulting composite hypothesis testing problem 
is typically solved using a GLRT (see Q for results related to the present paper, and (H for a more 
recent account) of the following form: 

C 1 = I{ sup £ T n {z) log 4t4 > ^} • (5) 

We show that this test can be interpreted as a relaxation of the Hoeffding test of (0]). In particular we 
show that §5§ can be expressed in a form similar to (@]), 



C 1 = I{ J D MM (r n ||7r°) > r]} (6) 



where D MM is the mismatched divergence; a relaxation of the K-L divergence, in the sense that D MM (/i||7r) < 
D{^l\\tt) for any /x,7r G V{7.). We refer to the test © as the mismatched test. 

This paper is devoted to the analysis of the mismatched divergence and mismatched test. 

The terminology is borrowed from the mismatched channel (see Lapidoth Q for a bibliography). The 
mismatched divergence described here is a generalization of the relaxation introduced in |6j. In this way 
we embed the analysis of the resulting universal test within the framework of Csiszar and Shields Q. 
The mismatched test statistic can also be viewed as a generalization of the robust hypothesis testing 
statistic introduced in (H, Q. 

When the alternate distribution satisfies n 1 £ £ w o, we show that, under some regularity conditions on 
S^o, the mismatched test of © and Hoeffding's test of © have identical asymptotic performance in 
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terms of error exponents. A consequence of this result is that the GLRT is optimal in differentiating a 
particular distribution from others in an exponential family of distributions. We also establish that the 
proposed mismatched test has a significant advantage over the Hoeffding test in terms of finite sample size 
performance. This advantage is due to the difference in the asymptotic variances of the two test statistics 
under the null hypothesis. In particular, we show that the variance of the K-L divergence grows linearly 
with the alphabet size, making the test impractical for applications involving large alphabet distributions. 
We also show that the variance of the mismatched divergence grows linearly with the dimension d of 
the parameter space, and can hence be controlled through a prudent choice of the function class defining 
the mismatched divergence. 

The remainder of the paper is organized as follows. We begin in Section [TT] with a description of 
mismatched divergence and the mismatched test, and describe their relation to other concepts including 
robust hypothesis testing, composite hypothesis testing, reverse I-projection, and maximum likelihood 
(ML) estimation. Formulae for the asymptotic mean and variance of the test statistics are presented in 
Section |III] Section [III] also contains a discussion interpreting these asymptotic results in terms of the 
performance of the detection rule. Proofs of the main results are provided in the appendix. Conclusions 
and directions for future research are contained in Section |IV] 



We adopt the following compact notation in the paper: For any function /: Z — > R and -k 6 V(Z) we 
denote the mean Xlzez f( z ) n ( z ) n (f)> or by ( 7r ' /) when we wish to emphasize the convex-analytic 
setting. At times we will extend these definitions to allow functions / taking values in a vector space. 
For z€Z and it € V(Z), we still use n(z) to denote the probability assigned to element z under measure 
tt. The meaning of such notation will be clear from context. 

The logarithmic moment generating function (log-MGF) is denoted 



where 7r(exp(/)) = ^2 zeZ n(z) exp(/(z)) by the notation we introduced in the previous paragraph. For 
any two probability measures u 1 ,^ 2 € V{Z) the relative entropy is expressed, 



II. Mismatched Divergence 



AM) = log(7r(exp(/))) 
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where z/ 1 -< v 2 denotes absolute continuity. The following proposition recalls a well-known variational 
representation. This can be obtained, for instance, by specializing the representation in ifTOll to an i.i.d. 
setting. An alternate variational representation of the divergence is introduced in mi- 
Proposition II.l. The relative entropy can be expressed as the convex dual of the log moment generating 
function: For any two probability measures v\v 2 £ ^(Z), 

D{v 1 \\v 2 ) = su V {v 1 {f)-K*{f)) (7) 
/ 

where the supremum is taken over the space of all real-valued functions on Z. Furthermore, if v 1 and v 2 
have equal supports, then the supremum is achieved by the log likelihood ratio function f* = log^ 1 / v 2 ). 

Outline of proof: Although the result is well known, we provide a simple proof here since similar 
arguments will be reused later in the paper. 

For any function / and probability measure v we have, 

D(v x \\v 2 ) = {v 1 ,log{y 1 /v 2 )) 

= ^ 1 ,log(^ 2 )) + ^ 1 ,log( l , 1 /^)} 
On setting v = v 2 exp(/ — A^(/)) this gives, 

D(u l \\v 2 ) = u l (f) - AXf) + iVH > Af) - M/)- 

If v 1 and v 2 have equal supports, then the above inequality holds with equality for / = \og{iA /u 2 ), 
which would lead to v = v 1 . This proves that (0 holds whenever v l and v 2 have equal supports. The 
proof for general distributions is similar and is omitted here. ■ 
The representation (O is the basis of the mismatched divergence. We fix a set of functions denoted 
by J 7 , and obtain a lower bound on the relative entropy by taking the supremum over the smaller set as 
follows, 

ZrVHz^^supj^-A^/)}. (8) 

If v 1 and v 2 have full support, and if the function class T contains the log-likelihood ratio function 
/* = log(z^ 1 /^ 2 ), then it is immediate from Proposition III. II that the supremum in ([8]) is achieved by 
/*, and in this case D MM (u 1 \\v 2 ) = D{v l \\v 2 ). Moreover, since the objective function in ([8]) is invariant 
to shifts of /, it follows that even if a constant scalar is added to the function /*, it still achieves the 
supremum in (HJ. 
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In this paper the function class is assumed to be defined through a finite-dimensional parametrization 
of the form, 

T = {f r : r G R d } (9) 

Further assumptions will be imposed in our main results. In particular, we will assume that f r (z) is 
differentiable as a function of r for each z. 

We fix a distribution 7r G V(Z) and a function class of the form ©. For each r G M. d the twisted 
distribution tt t G V(Z) is defined as, 

vr r :=7rexp(/ r -A^(/ r )). (10) 
The collection of all such distributions parameterized by r is denoted 

£n := {fr r : r G M. d }. (11) 

A. Applications 

The applications of mismatched divergence include those applications surveyed in Section 3 of HI in 
their treatment of generalized likelihood ratio tests. Here we list potential applications in three domains: 
Hypothesis testing, source coding, and nonlinear filtering. Other applications include channel coding and 
signal detection, following BJ. 

1) Hypothesis testing: The problem of universal hypothesis testing is relevant in several practical 
applications including anomaly detection. It is often possible to have an accurate model of the normal 
behavior of a system, which is usually represented by the null hypothesis distribution ir°. The anomalous 
behavior is often unknown, which is represented by the unknown alternate distribution. The primary 
motivation for our research is to improve the finite sample size performance of Hoeffding's universal 
hypothesis test ([3]). The difficulty we address is the large variance of this test statistic when the alphabet 
size is large. Theorem III.2I makes this precise: 

Theorem II.2. Let -it , it 1 G V(Z) have full supports over Z. 
(i) Suppose that the observation sequence Z is i.i.d. with marginal ir°. Then the normalized Hoeffding 
test statistic sequence {nD(r n \\-K°) : n > 1} has the following asymptotic bias and variance: 

lim E[nL»(r n ||7r )] = |(iV-l) (12) 

n— >-oo 

lim VarlnD^Hvr )] = \{N - 1) (13) 
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where N = | Z | denotes the size ( cardinality ) of Z. Furthermore, the following weak convergence 
result holds: 

nD(n\7T°) lx 2 N _, (14) 

where the right hand side denotes the chi-squared distribution with N — 1 degrees of freedom. 
(ii) Suppose the sequence Z is drawn i.i.d. under tt 1 ^ tt°. We then have, 

lim EfnfDfrlTr ) - TXvr 1 llvr ))! = ±(JV - 1) 
n— >oo 

□ 

The bias result of (fT2l ) follows from the unpublished report |fT2l (see |[T3l Sec III.C]), and the weak 
convergence result of (fT4b is given in lfl4l . All the results of the theorem, including (fT3l) also follow 
from Theorem IIII.2I — We elaborate on this in section |III] 

We see from Theorem [TO] that the bias of the divergence statistic D(r n ||7r°) decays as ^5— ^, irrespective 
of whether the observations are drawn from distribution tt° or tt 1 . One could argue that the problem of 
high bias in the Hoeffding test statistic can be addressed by setting a higher threshold. However, we also 
notice that when the observations are drawn under 7r°, the variance of the divergence statistic decays as 
^tt, which can be significant when N is of the order of n 2 . This is a more serious flaw of the Hoeffding 
test for large alphabet sizes, since it cannot be addressed as easily. 

The weak convergence result in (fT4l . and other such results established later in this paper, can be used 
to guide the choice of thresholds for a finite sample test, subject to a constraint on the probability of false 
alarm (see for example, (7J p. 457]). As an application of (fT2l ) we propose the following approximation 

for the false alarm probability in the Hoeffding test defined in dU), 

N-l 

p^:=P^{4>l = l}^P{\Y J Wf>ni 1 } (15) 

i=i 

where {Wi} are i.i.d. iV(0, 1) random variables. In this way we can obtain a simple formula for the 
threshold to approximately achieve a given constraint on p FA . For moderate values of the sequence length 
n, the x 2 approximation gives a more accurate prediction of the false alarm probabilities for the Hoeffding 
test compared to those predicted using Sanov's theorem as we demonstrate below. 

Consider the application of (fT31) in the following example. We used Monte-Carlo simulations to 
approximate the performance of the Hoeffding test described in (0]), with 7r° the uniform distribution on 
an alphabet of size 20. Shown in Figure Q] is a semi-log plot comparing three quantities: The probability 
of false alarm p FA , estimated via simulation; the approximation (fT31) obtained from the Central Limit 
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Theorem; and the approximation obtained from Sanov's Theorem, log(p FA ) sa —nr]. It is clearly seen 
that the approximation based on the weak convergence result of (fT31) is far more accurate than the 
approximation based on Sanov's theorem. It should be noted that the approximate formula for the false 
alarm probability obtained from Sanov's theorem can be made more accurate by using refinements of large 
deviation results given in [15]. However, these refinements are often difficult to compute. For instance, 
it can be shown using the results of 11151 that p m « cn = exp(— nrj) where constant c is given by a 
surface integral over the surface of the divergence ball, Q n {ir ). 
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Fig. 1. Approximations for the error probability in universal hypothesis testing. The error probability of the Hoeffding test is 
closely approximated by the approximation dl5| i. 



One approach to addressing the implementation issues of the universal test is through clustering (or 
partitioning) the alphabet as in |fT6ll , or smoothing in the space of probability measures as in ifTTl , |[T8l 
to extend the Hoeffding test to the case of continuous alphabets. The mismatched test proposed here is 
a generalization of a partition in the following sense. Suppose that {Ai : 1 < i < N a } are disjoint sets 
satisfying UAi = X, and let Y(t) = i if X(t) G Aj. Applying ( fT3T ), we conclude that the Hoeffding test 
using Y instead of X will have asymptotic variance equal to \{N a — 1), where N a < N for a non-trivial 
partition. We have: 

Proposition II.3. Suppose that the mismatched divergence is defined with respect to the linear function 
class d26l ) using ipi = 1^., 1 < i < N a . In this case the mismatched test ([5]) coincides with the Hoeffding 
test using observations Y. □ 
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The advantage of the mismatched test ([5]) over a partition is that we can incorporate prior knowledge 
regarding alternate statistics, and we can include non-standard 'priors' such as continuity of the log- 
likelihood ratio function between the null and alternate distributions. This is useful in anomaly detection 
applications where one may have models of anomalous behavior which can be used to design the correct 
mismatched test for the desired application. 

2) Source coding with training: Let tt denote a source distribution on a finite alphabet Z. Suppose we 
do not know tt exactly and we design optimal codelengths assuming that the distribution is fi: For letter 
z G Z we let £{z) = — log(/i(z)) denote Shannon's codeword length. The expected codelength is thus, 

zez 

where H denotes the entropy, — J2zez 7r ( z ) l°g( 7r (- 2 ))- Let £* := H(tt) denote the optimal (minimal) 
expected codelength. 

Now suppose it is known that under tt the probability of each letter z G Z is bounded away from zero. 
That is, we assume that for some e > 0, 

tt G F € := {n G V(Z) : n{z) > e, for all z G Z}. 

Further suppose that a training sequence of length n is given, drawn under tt. We are interested in 
constructing a source code for encoding symbols from the source tt based on these training symbols. Let 
T n denote the empirical distribution (i.e., the type) of the observations based on these n training symbols. 
We assign codeword lengths to each symbol z according to the following rule, 

bg^ ifr«GP £/2 



£(z) = < 



log -j^r else 



E[£ n \r\ 



where n a is the uniform distribution on Z. 

Let T denote the sigma-algebra generated by the training symbols. The conditional expected codelength 
given T satisfies, 

£* + D(-K\\T n ) ifr n GP e/2 
£* + D(tt\\tt u ) else 

We study the behavior of E[£ n — £*\T] as a function of n. We argue in the appendix that a modification 
of the results from Theorem IIII.2I can be used to establish the following relations: 

n— >oo 

E[n{£ n -t)) ► UN -I) (16) 

Var[nEnr]] > l(N-l) 
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where N is the cardinality of the alphabet Z. Comparing with Theorem III.2I we conclude that the 
asymptotic behavior of the excess codelength is identical to the asymptotic behavior of the Hoeffding 
test statistic Z?(r n ||7r) under it. Methods such as those proposed in this paper can be used to reduce high 
variance, just as in the hypothesis testing problem emphasized in this paper. 

3) Filtering: The recent paper |[T9l considers approximations for the nonlinear filtering problem. 
Suppose that X is a Markov chain on M n , and Y is an associated observation process on MP of the 
form Y(t) = j(X(t), W(t)), where W is an i.i.d. sequence. The conditional distribution of X(t) given 
{y(0), . . . , Y(t)} is denoted B t ; known as the belief state in this literature. The evolution of the belief 
state can expressed in a recursive form: For some mapping 4>: BiW 1 ) xP-> BiW 1 ), 

B t+1 = <t>(B t ,Y t+1 ), t>0 

The approximation proposed in (I9\i is based aprojection of B t onto an exponential family of densities 
over W 1 , of the form pg(x) = po(x) ex.p(9 T ifj(x) — A(0)), 9 G R d . They consider the reverse I-projection, 

B Q = argmin.D(.B||/x) 

where the minimum is over £ = {pe}- From the definition of divergence this is equivalently expressed, 

B Q = argmax J (9 T ip(x) - A(0)) B(dx) (17) 
A projected filter is defined by the recursion, 

B t+1 = [<j>(B t ,Y t+1 )]e, t>0 (18) 

The techniques in the current paper provide algorithms for computation of this projection, and suggest 
alternative projection schemes, such as the robust approach described in Section HI-FI 

B. Basic structure of mismatched divergence 

The mismatched test is defined to be a relaxation of the Hoeffding test described in (O. We replace 
the divergence functional with the mismatched divergence D MM (r n ||7r°) to obtain the mismatched test 
sequence, 

07 = I{D MM {r n \\7r ) >7]} = I{T n i Q7(vr )} (19) 
where Q™ M (ir°) is the mismatched divergence ball of radius rj around ir° defined analogously to (Q]): 

= {v G P(Z) : D^{v\\n) < r,}. (20) 
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The next proposition establishes some basic geometry of the mismatched divergence balls. For any 
function g we define the following hyperplane and half-space: 

U g := {v : u(g) = 0} 

(21) 

Hg := {v : u{g) < 0}. 

Proposition II.4. The following hold for any u, tt G T J (Z), and any collection of functions T: 

(i) For each r\ > we have Q™ m (tt) C C\T~lg, where the intersection is over all functions g of the 
form, 

g = f-K{f)-n (22) 

with f G T. 

(ii) Suppose that r\ = D MM (zv||7r) is finite and non-zero. Further suppose that for v 1 = v and v 2 = tt, 
the supremum in ((8]) is achieved by f* € T. Then H g , is a supporting hyperplane to Q^ M (vr), where 
g* is given in d22l) with f = f*. 

Proof: (i) Suppose fi G Q^ m (tt). Then, for any / G T, 

/ ,(/)-A 7r (/)-r / < J D MM (/.||7r)- ?? <0 

That is, for any / G T, on defining g by d22l we obtain the desired inclusion Q^ m (tt) C H~ . 
(ii) Let /j G % g > be arbitrary. Then we have: 

D MM GuM = SU P (p(f r ) - AMr)) 

r 

> p(f*)-AMl 

= K(n + v-K(f*) = ri. 

Hence it follows that T~L g * supports Q^ m (tt) at v. 

■ 

C. Asymptotic optimality of the mismatched test 

The asymptotic performance of a binary hypothesis testing problem is typically characterized in terms 
of error exponents. We adopt the following criterion for performance evaluation, following Hoeffding |2l 
(and others, notably fTTl . lfl"8l .) Suppose that the observations Z = {Zt '■ t = 1, ...} form an i.i.d. 
sequence evolving on Z. For a given 7r°, and a given alternate distribution tt 1 , the type I and type II error 
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Fig. 2. Geometric interpretation of the log likelihood ratio test. The exponent (3* — f3* (77) is the largest constant satisfying 
Q v (tt°) n Qp*{-K l ) = 0. The hyperplane H LLR := {v : v(V) = tt(L)} separates the convex sets S^(7r°) and Qp*^ 1 ). 



exponents are denoted respectively by, 

J°:=lim inf-ilog(P ffO {0 n = l}), 

(23) 

Ji := lim inf log(P w i{^, = 0}) 

r n— >oo n 

where in the first limit the marginal distribution of Z t is tt°, and in the second it is tt 1 . The limit J° is 
also called the false-alarm error exponent, and J\ the missed-detection error exponent. 

For a given constraint 7] > on the false-alarm exponent j9, an optimal test is the solution to the 
asymptotic Neyman-Pearson hypothesis testing problem, 

P*(rj) = sup{jj : subject to J% > 77} (24) 

where the supremum is over all allowed test sequences cj>. While the exponent /3*(rj) = /3*(r), tt 1 ) depends 
upon 7T 1 , Hoeffding's test we described in (0} does not require knowledge of tt 1 , yet achieves the optimal 
exponent /3* (77, 7T 1 ) for any tt 1 . The optimality of Hoeffding's test established in (H easily follows from 
Sanov's theorem. 

While the mismatched test described in © is not always optimal for (|24l) for a general choice of tt 1 , it is 
optimal for some specific choices of the alternate distributions. The following corollary to Proposition III.4I 
captures this idea. 

Corollary II.l. Suppose vr ,^ 1 € V(Z) have equal supports. Further suppose that for all a > 0, there 
exists t € M and r£t (! such that 

aL(z) + r = f r (z) a.e. [tt°], 
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where L is the log likelihood-ratio function L := log(7r 1 /7r ). Then the mismatched test is optimal in the 
sense that the constraint J? M m > V is satisfied with equality, and under ir 1 the optimal error exponent is 
achieved; i.e. JJ M m = P*(v) f or a ^ V ^ (ty D(tt 1 ||7t )). 

Proof: Suppose that the conditions stated in the corollary hold. Consider the twisted distribution 
7r = K(7r ) 1 ~ £ '(7r 1 ) £ ', where k is a normalizing constant and q G (0,1) is chosen so as to guarantee 
-D(7r||7r°) = 7]. It is known that the hyperplane T~L LLR := {v : v(L) = tt(L)} separates the divergence balls 
Q ri (ir°) and Qp* (ir 1 ) at tt. This geometry, which is implicit in ifTTl , is illustrated in Figure |2] 
From the form of tt it is also clear that 

TT 

log — = qL-A 7T o(qL). 

Hence it follows that the supremum in the variational representation of D(7r||7r°) is achieved by qL. 
Furthermore, since qL + r G F for some r G M we have 

D MM (7r||vr°) = D(tt\\tt ) = V 

= tt(qL + r) - A n o (qL + r) 
= tt(qL) - K°{qL). 

This means that "H LLR = {v : u(qL — A 7r o(^L) — if) = 0}. Hence, by applying Proposition III.4I (ii) it 
follows that the hyperplane % LLR separates Q^ m (tt°) and Qp.^ir 1 ). This in particular means that the sets 
Q" M (7T°) and Qp-^K 1 ) are disjoint. This fact, together with Sanov's theorem proves the corollary. ■ 

The corollary indicates that while using the mismatched test in practice, the function class might be 
chosen to include approximations to scaled versions of the log-likelihood ratio functions of the anticipated 
alternate distributions {vr 1 } with respect to tt°. 

The mismatched divergence has several equivalent characterizations. We first relate it to an ML estimate 
from a parametric family of distributions. 

D. Mismatched divergence and ML estimation 

On interpreting f r — A 7r (/ r ) as a log-likelihood ratio function we obtain in Proposition III.5 1 the following 
representation of mismatched divergence, 

D mm (p\\tt) = sup (M/r) - AAfr)) = D(p\\tt) - inf D(ji\\u). (25) 

The infimum on the RHS of d25l ) is known as reverse I-projection Q. Proposition III.6I that follows uses 
this representation to obtain other interpretations of the mismatched test. 
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Proposition II.5. The identity (125b holds for any function class T. The supremum is achieved by some 
r* £ M. d if and only if the infimum is attained at v* = tt t € S n . If a minimizer v* exists, we obtain the 
generalized Pythagorean identity, 

D(p\\Tr) = D™(p\\n) + D(p\\v*) 
Proof: For any r we have p(f r ) — A 7r (/ r ) = ^(log(7r r /-7r)). Consequently, 

D MM (^||7T)=SUp( / u(/ r )-A^(/ r )) 

r 

= sup /i log 

V V 71 " A* 

= sup{D(M||7r)-D( A1 ||7r r )} 
r 

This proves the identity (|25T ). and the remaining conclusions follow directly. ■ 
The representation of Proposition III. 5 1 invites the interpretation of the optimizer in the definition of 
the mismatched test statistic in terms of an ML estimate. Given the well-known correspondence between 
maximum-likelihood estimation and the generalized likelihood ratio test (GLRT), Proposition III.6I implies 
that the mismatched test is a special case of the GLRT analyzed in Q- 

Proposition II.6. Suppose that the observations Z are modeled as an i.i.d. sequence, with marginal in 
the family £ n . Let f n denote the ML estimate of r based on the first n samples, 

r n € arg max P jr r{Z 1 = oi, Z 2 = a 2 , ■ ■ ■ , Z n = a n } 

rGR d 

= ar g max IT™ = 1 fi r ( <n ) 

where aj indicates the observed value of the i-th symbol. Assuming the maximum is attained we have 
the following interpretations: 

(i) The distribution 7r r ' 1 solves the reverse I-projection problem, 

7r r " € argminD(r n ||i/). 

(ii) The function f* = ffn achieves the supremum that defines the mismatched divergence, D MM (T n ||7r) = 

Proof: The ML estimate can be expressed f n = argmax rgR<i (r Tl ,log7r r '), and hence (i) follows by 
the identity, 

argminZ)(r n ||^) = arg max (T n , log v), v € V. 
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Combining the result of part (i) with Proposition III- 51 we get the result of part (ii). 
From conclusions of Proposition III. 5 1 and Proposition III.6I we have, 



z? MM (r n ||vr) = (r n ,io g — ) 

7T 



max (r n , log ■ 



1 i v ( Zi ) 
= max— > log . 

In general when the supremum in the definition of Z? MM (r n ||7r) may not be achieved, the maxima in the 
above equations are replaced with suprema and we have the following identity: 



D MM (r-||vr)= S upi^log^f|. 



Thus the test statistic used in the mismatched test of © is exactly the generalized likelihood ratio between 
the family of distributions £ w o and tt° where 

£ n o = {vr° exp(/ r - A w o(f r )) : r G R d }. 

More structure can be established when the function class is linear. 



E. Linear function class and I-projection 

The mismatched divergence introduced in O was restricted to a linear function class. Let {ipi : 1 < 

i < d} denote d functions on Z, let tp = (tpi, . . . , ipd) T , and let f r = r T ip in the definition (|9]): 

d 

i=l 

A linear function class is particularly appealing because the optimization problem in ([8]) used to define the 
mismatched divergence becomes a convex program and hence is easy to evaluate in practice. Furthermore, 
for such a linear function class, the collection of twisted distributions £ w defined in (fTTT) forms an 
exponential family of distributions. 

Proposition III- 51 expresses D MM (fi\\ir) as a difference between the ordinary divergence and the value 
of a reverse I-projection inf^g^ D(fi\\i/). The next result establishes a characterization in terms of a 
(forward) I-projection. For a given vector c € M d we let P denote the moment class 

P = { v e V(Z) : u(if)) = c} (27) 

where = (z/(^i), i/(^ 2 ), • • • , v{ip d )) T . 
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Proposition II.7. Suppose that the supremum in the definition of D mm ([i\\it) is achieved at some r* G 
Then, 

(i) The distribution v* := fr r ' G £ T satisfies, 

D mm (h\\tt) = D(u*\\tt) = m3n{D(u\\ir) :v£F}, 



where P is defined using c = in \27\ . 
(ii) D mm (/j>\\it) = min{D(y ||-7r) : v G Hg*}, where g* is given in (|22l) with f = r* T ip, and r] = 
D MM {p\\ir). 

Proof: Since the supremum is achieved, the gradient must vanish by the first order condition for 
optimality: 



V(M)-K(fr)) 



o 



The gradient is computable, and the identity above can thus be expressed /i(^) — 7f r * (tp) = 0. That is, 
the first order condition for optimality is equivalent to the constraint 7r r * G P. Consequently, 



D(u*\\x) = (r',log—) 

TT 

= tt t ' (r* T ip) - A 7r (r* T -i/') 

Furthermore, by the convexity of A 7r (/ r ) in r, it follows that the optimal r* in the definition of D mm (i/\\tt) 
is the same for all v G P. Hence, it follows by the Pythagorean equality of Proposition III.5I that 

D(u\\n) = D(v\\v*) + D(v*\\tt), for all v G P. 

Minimizing over v G P it follows that v* is the I-projection of 7r onto P: 

D(v*\\ir) = mm{D(v\\ir) :u£F} 

which gives (i). 

To establish (ii), note first that by (i) and the inclusion P C 1-Lg* we have, 

D mm (h\\it) = min{D(i/||7r) : v G P} 
> mf{D{v\\ir) : v G H g *}. 
The reverse inequality follows from Proposition III.4I (i). and moreover the infimum is achieved with v* . 
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Fig. 3. Interpretations of the mismatched divergence for a linear function class. The distribution 7f r is the I-projection of it 
onto a hyperplane 'rig*. It is also the reverse I-projection of onto the exponential family £ w . 

The geometry underlying mismatched divergence for a linear function class is illustrated in Figure [3] 
Suppose that the assumptions of Proposition III.7I hold, so that the supremum in (|25T ) is achieved at r* . 
Let rj = D MM (n\\ir) = /j(/ r .) - K{fr*), and g* = f r . - (r? + A 7r (/ r -)) . Proposition HI implies that 
T-Lg, defines a hyperplane passing through fi, with Qri(^) C Q^ m (tt) C ^~». This is strengthened in the 
linear case by Proposition III.71 which states that T~L g ' supports Q ri {^) at the distribution TT r * . Furthermore 
Proposition III. 5 1 asserts that the distribution tt v ' minimizes D(/j,\\7t) over all tt € £ n . 

The result established in Corollary III. II along with the interpretation of the mismatched test as a GLRT 
can be used to show that the GLRT is asymptotically optimal for an exponential family of distributions. 

Theorem II.8. Let tt° be some probability distribution over a finite set Z. Let T be a linear function class 
as defined in \26\ and £ n o be the associated exponential family of distributions defined in rti7l) . Consider 
the generalized likelihood ratio test ( GLRT) between tt° and £ w o defined by the following sequence of 
decision rules: 




The GLRT solves the composite hypothesis testing problem \24\ for all tt 1 G £ n o in the sense that the 
constraint 

J l 

glrt — V is satisfied with equality, and under tt 1 the optimal error exponent /5* (77) is achieved 
for all r] € (0, D^ 1 \\tt )) and for all tt 1 G £ % o; i.e., J^glrt = P*(v)- 

Proof: From Proposition III.6I and the discussion following the proposition, we know that <p GLKT is 
the same as the mismatched test defined with respect to the function class J 7 . Moreover, any distribution 
7T 1 G £^0 is of the form 7r r = 7r°exp(/ r — A 7r o(/ r )) for some r G M. d as defined in (fTOl ). Using L to 
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denote the log-likelihood ratio function between tt and tt , it follows by the linearity of T that for any 

a > 0, 

ah = a(/ r - A 7r o(/ r )) 

— far T 

for some r € R. Hence, it follows by the conclusion of Corollary III. II that the GLRT <^> GLRT solves the 
composite hypothesis testing problem (l24l ) between 7r° and £ w o. ■ 
The above result is a special case of the sufficient conditions for optimality of the GLRT established in 
(3j Thm 2, p. 1600]. From the proof it is easily seen that the result can be extended to hold for composite 
hypothesis tests between 7r° and any family of distributions £ n o of the form in (fTTT) provided T is closed 
under positive scaling. It is also possible to strengthen the result of Corollary III. II to obtain an alternate 
proof of (3j Thm 2, p. 1600]. We refer the reader to EDI for details. 

F. Log-linear function class and robust hypothesis testing 

In the prior work (8], [9[ the following relaxation of entropy is considered, 

D R0B (u||7r) :=inf D(u\\u) (28) 

where the moment class P is defined in (l27l) with c = Tr(ip), for a given collection of functions {ifii : 
1 < i < d}. The associated universal test solves a min-max robust hypothesis testing problem. 

We show here that D ROB coincides with D MM for a particular function class. It is described as (|9]) in 
which each function f r is of the log-linear form, 

f r = log(l + rV) 

subject to the constraint that l+r T t(;(z) is strictly positive for each z. We further require that the functions 
ip have zero mean under distribution tt - i.e., we require n(rp) = 0. 

Proposition II.9. For a given tt € V(Z), suppose that the log-linear function class T is chosen with 
functions {ipi} satisfying 7r(i/>) = 0. Suppose that the moment class used in the definition of D R0B is 
chosen consistently, with c = in A27i . We then have for each \i £ V(Z), 

^ MM (Mlk) = d R0 VI|7t) 

Proof: For each \i G V(Z), we obtain the following identity by applying Theorem 1.4 in (9), 

inf D(p\\u) = sup{^(log(l + r r V)) : 1 + r T tp(z) > for all z € Z} 
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Moreover, under the assumption that n(ip) = we obtain, 

A^(log(l + rV)) = log(7r(l + r T 1>)) = 
Combining these identities gives, 

D R0B {fi\\Tr) := inf D(fi\\u) 

= sup {/x(log(l + r r V)) - A^(log(l + r T iP)) : 

1 + r r V>0) > for all z G Z} 
= S up{M/)-A.(/)}= J D MM (^l|vr) 

■ 

III. Asymptotic Statistics 

In this section, we analyze the asymptotic statistics of the mismatched test. We require some assump- 
tions regarding the function class T = {f r : r G M. d } to establish these results. Note that the second and 
third assumptions given below involve a distribution fjP G V{Z), and a vector s G M d . We will make 
specialized versions of these assumptions in establishing our results, based on specific values of fj, and 
s. We use Z^o c Z to denote the support of [i° and V(Z^o) to denote the space of probability measures 
supported on Z^o, viewed as a subset of P(Z). 

Assumptions 

(Al) f r {z) is C 2 in r for each z G Z. 

(A2) There exists a neighborhood B of fi°, open in ^(Z^o) such that for each /x G B, the 
supremum in the definition of D MM (n\\n°) in (H]) is achieved at a unique point r(/i). 

(A3) The vectors {if)Q, . . . ,ipd} are linearly independent over the support of fj, , where 
ipo = 1, and for each i > 1 



d 



zeZ. (29) 



The linear-independence assumption in (A© is defined as follows: If there are constants {ao, . . . , a^} 
satisfying Yli=i a «V'j( 2; ) = a - e - [z^ ]* then a -i = for each i. In the case of a linear function class, 
the functions {if)i,i > 1} defined in (1291 are just the basis functions in (|26T ). Lemma UlL 1 1 provides an 
alternate characterization of Assumption (A0. 

For any /j G "P(Z) define the covariance matrix via, 

= M^i^j) - MV^MV'j), i < < d. (30) 
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We use Cov^(g) to denote the co variance of an arbitrary real- valued function g under fi: 

Co^ L (g) := n{g 2 ) - v{g) 2 . (31) 

Lemma III.l. Assumption (A\3} holds if and only if S^o > 0. 

Proof: We evidently have v T T,^ov = Cov ^o{y T if)) > for any vector v 6 M. d . Hence, we have the 

following equivalence: For any v S R rf , on denoting c v = fi°(v T if)), 

d 

v T T,^ov = <^> ^Viipiiz) = c v a.e. \p°] 

i=l 

The conclusion of the lemma follows. ■ 
We now present our main asymptotic results. Theorem IIII.2I identifies the asymptotic bias and variance 
of the mismatched test statistic under the null hypothesis, and also under the alternate hypothesis. A key 
observation is that the asymptotic bias and variance does not depend on N, the cardinality of Z. 

Theorem 111.2. Suppose that the observation sequence Z is Ltd. with marginal tt. Suppose that there 
exists r* satisfying f r * = log(7r/-7r ). Further, suppose that Assumptions (A\l}, (A\2^, (A\3ty hold with /i° = tt 
and s = r*. Then, 
(i) When tt = tt°, 

lim E[nD MM (r n ||7r )] = \d (32) 

n— >oo 

lim Var[nD MM (r n ||^ )] = \d (33) 

n— >oo 

nD MM (r n ||7r°) > \xl 



(ii) When tt = tt 1 ^ tt°, we have with o\ := Cov^i 



lim EKD^rlTT ) -D{tt x \tP))\ = \d (34) 



2 l 

i nMM/-pnii„Ch 



lim Var[niD MM (T n \\TT u )} = a{ (35) 



ni(D MM (T n \\TT°) - D^Wtt )) M(0,al). (36) 



□ 



In part (ii) of Theorem |III.2[ the assumption that r* exists implies that tt 1 and tt° have equal supports. 
Furthermore, if Assumption (AS) holds in part (ii), then a sufficient condition for Assumption (A© is that 
the function V(r) := (— TT l (f r ) + A 7r o(/ r )) be coercive in r. And, under (AS), the function V is strictly 
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convex and coercive in the following settings: (i) If the function class is linear, or (ii) the function class is 
log-linear, and the two distributions ir 1 and tt° have common support. We use this fact in Theorem UII.3I 
for the linear function class. The assumption of the existence of r* satisfying f r * = log(7r 1 /7r ) in part 
(ii) of Theorem IIII.2I can be relaxed. In the case of a linear function class we have the following extension 
of part (ii). 

Theorem III.3. Suppose that the observation sequence Z is drawn i.i.d. with marginal tt 1 satisfying 
it 1 -< 7T°. Let J- be the linear function class defined in f !26D . Suppose the supremum in the definition of 
jjmm^i n^o^ j s ac hi evec [ a f SO me r 1 € M. d . Further, suppose that the functions {ipi} satisfy the linear 
independence condition of Assumption (yQ with pP = tt 1 . Then we have, 

lim E[n(D MM (r n ||7r )- J D MM (7r 1 ||7r ))l = kracefS^Sr 1 ) 

n— kx) 

lim Var[n'D MM (r n ||7r )] = a\ 

n— voo 

where in the first limit tt = 7r°exp(/ r i — A 7r o(/ r i)), and and are defined as in rt30D . In the second 
two limits o\ = Cov^i (/ r i). □ 

Although we have not explicitly imposed Assumption (Af2]) in Theorem |HI.3[ the argument we pre- 
sented following Theorem IIII.2I ensures that when tt 1 -< 7r°, Assumption (A[2]) is satisfied whenever 
Assumption (A0 holds. Furthermore, it can be shown that the achievement of the supremum required in 
Theorem IIII.3I is guaranteed if tt 1 and ir° have equal supports. We also note that the vector s appearing 
in eq. (|29l ) of Assumption (AO is arbitrary when the parametrization of the function class is linear. 

The weak convergence results in Theorem lIII.2l (i) can be derived from Clarke and Barron lfT2ll . |[T3l (see 
also (7J Theorem 4.2]), following the maximum-likelihood estimation interpretation of the mismatched 
test obtained in Proposition III.6I In the statistics literature, such results are called Wilks phenomena after 
the initial work by Wilks |[T4l . 

These results can be used to set thresholds for a target false alarm probability in the mismatched test, 
just like we did for the Hoeffding test in (fT5T ). It is shown in lf2TTl that such results can be used to set 
thresholds for the robust hypothesis testing problem described in Section HI-FI 

Implications for Hoeffding test The divergence can be interpreted as a special case of mismatched diver- 
gence defined with respect to a linear function class. Using this interpretation, the results of Theorem lIII.2l 
can also be specialized to obtain results on the Hoeffding test statistic. To satisfy the uniqueness condition 
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of Assumption (AEK we require that the function class should not contain any constant functions. Now 
suppose that the span of the linear function class F together with the constant function f° = 1 spans the 
set of all functions on Z. This together with Assumption (AS) would imply that d = N — 1, where N is 
the size of the alphabet Z. It follows from Proposition III. 1 1 that for such a function class the mismatched 
divergence coincides with the divergence. Thus, an application of Theorem IIII.2I (i) gives rise to the 
results stated in Theorem III.2I 

To prove Theorem IIII.2I and Theorem IIII.3I we need the following lemmas, whose proofs are given in 
the Appendix. 

The following lemma will be used to deduce part (ii) of Theorem IHI.2I from part (i). 

Lemma III.4. Let D^m d eno t e f ne mismatched divergence defined using function class T. Suppose 
ir 1 -< 7T° and the supremum in the definition of D^ 1 (7T 1 \\ir°) is achieved at some f T * £ T. Let n = 
7T° exp(/ r . — A 7r o(/ r .)) and Q = T — f r * := {f r — f r * : r G M. d }. Then for any p satisfying p -< it , we 
have 

L>r(Hk°) = DTi^V) + D^(p\\7T) + (p - 7r\log(4)). (37) 

□ 

Suppose we apply the decomposition result from Lemma Hll.41 to the type of the observation sequence 
Z, assumed to be drawn i.i.d. with marginal ir 1 . If there exists r* satisfying / r » = log^ 1 /vr°), then we 
have 7T = 7T . The decomposition becomes 

D^- M (r n ||7r°) = ^(vr^vr ) + ^(r^Hvr 1 ) + (T n - tt\ / r .>. (38) 

For large n, the second term in the decomposition (I38T ) has a mean of order n" 1 and variance of order 
ra~ 2 , as shown in part (i) of Theorem IHI.2I The third term has zero mean and variance of order re" 1 , 
since by the Central Limit Theorem, 

ni^-vr 1 ,/^) -^.^(O.Cov^C/r.)). (39) 

n— >oo 

Thus, the asymptotic variance of Dj- M (r n ||7r°) is dominated by that of the third term and the asymptotic 
bias is dominated by that of the second term. Thus we see that part (ii) of Theorem IIII.2I can be deduced 
from part (i). 

Lemma III.5. Let X = {X 1 : i = 1, 2, . . .} be an Ltd. sequence with mean x taking values in a compact 
convex set X C M. m , containing x as a relative interior point. Define S n = - Y^i=i Suppose we are 
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given a function h : W 71 i— > M, together with a compact set K containing x as a relative interior point 
such that, 

1) The gradient Vh(x) and the Hessian V 2 /i(x) are continuous over a neighborhood of K. 

2) lim --\ogP{S n (£K}>0. 

n— >oo n 

Let M = V 2 h(x) and E = Covpf 1 ). Then, 

(i) The normalized asymptotic bias of {h(S n ) : n > 1} is obtained via, 

lim nE[h(S n ) - h(x)] = krace(MH) 

n— >oo * 

(ii) If in addition to the above conditions, the directional derivative satisfies Vh(x) T (X l — x) = 
almost surely, then the asymptotic variance decays as n~ 2 , with 

lim n 2 Var\h(S n )} = krace(MHMH) 

n— >oo 

□ 

Lemma III.6. Suppose that the observation sequence Z is drawn i.i.d. with marginal p £ V(Z). Let 
h : V{7.) i— > M be a continuous real-valued function whose gradient and Hessian are continuous in a 
neighborhood of p. If the directional derivative satisfies \7h{p) T {v — p) = for all u S V(Z), then 

n(h(T n ) - h(p)) \W T MW (40) 

n— >oc 

where M = V 2 /i(/i) and W ~ M(0, ^w) with Y>yv = diag(p) — pp T . □ 

Lemma III.7. Suppose that V is an m-dimensional, M(0,I m ) random variable, and D : W n W 1 is 
a projection matrix. Then £ := \\DV\\ 2 is a chi-squared random variable with K degrees of freedom, 
where K denotes the rank of D. □ 

Before we proceed to the proofs of Theorem IIII.2I and Theorem IIII.3I we recall the optimization 
problem (1251 ) defining the mismatched divergence: 



D^pWtt ) = sup(M(/ r )-A*o(/ r )). (41) 

r£R d 

The first order condition for optimality is given by, 

g(p,r) = (42) 
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where g is the vector valued function that defines the gradient of the objective function in (HTI ): 

g{n, r) := V r (fi(f r ) - K« (fr)) 

The derivative of r) with respect to r is given by 

"vr°( e /-V r / r V r /;) +7r°( e /-V2/ r ) 



V r g(fi,r) = n(V r f r ) 



7r°(eA) 
n (e^V r f r )ir (e^V r f{) 



(44) 



( 7r o( e /,))2 

In these formulae we have extended the definition of fi(M) for matrix-valued functions M on Z via 

[/i(M)]„ := /i(Afij) = £ z Mijiz)^). On letting ^ = V r / r we obtain, 

g(ji,r) = ti(V)-* r (Tp r ) (45) 

V r g(ji,r) = ^(ylf^-nylu)- 

[7T r (^ r iP rT ) - Tt T {lp r )* T {li) rT )} (46) 

where the definition of the twisted distribution is as given in ( fTOl ): 

7f r :=7r exp(/ r - A 7r o(/ r )). 

Proof of Theorem \III.2k Without loss of generality, we assume that 7r° has full support over Z. 
Suppose that the observation sequence Z is drawn i.i.d. with marginal distribution it £ V(Z). We have 
D MM (T n \\-K°) L> MM (vr||vr ) by the law of large numbers. 

n— >oo 

1) Proof of part (i): We first prove the results concerning the bias and variance of the mismatched 
test statistic. We apply Lemma Hll.51 to the function h(n) := -D MM (/z||7r°). The other terms appearing in 
the lemma are taken to be X 1 = (I Zl (Zi),I Z2 (Zi), . . . , I ZN (Zi)) T , X = V(Z), x = vr°, and S n = T n . Let 
H = Cov(X 1 ). It is easy to see that H = diag(7r°) — 7r°7r 0r and S^o = ^E^f T , where S^o is defined in 
QUI ), and ^ is a d x N matrix defined by, 

This can be expressed as the concatenation of column vectors via ^ = [t/j(zi), ^{z-z), . . . , i/)(zn)]. 
We first demonstrate that 

M = V 2 /i(vr ) = ^(E^) -1 *, (48) 
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and then check to make sure that the other requirements of Lemma IIII.5I are satisfied. The first two 
conclusions of Theorem IIII.2I (i) will then follow from Lemma IIII.51 since 

trace(MS) = trace((S 7r o)~ 1 ^H^ r ) = trace(I d ) = d, 

and similarly trace(MSMS) = trace(Id) = d. 

We first prove that under the assumptions of Theorem IIII.2l (i). there is a function r : V(Z) i-> R that is 
C 1 in a neighborhood of 7r° such that r(fi) solves (PiT) for fi in this neighborhood. Under the uniqueness 
assumption (A[2]), the function r(/j,) coincides with the function given in (AfSJ). 

By the assumptions, we know that when fi = ir°, (l42l) is satisfied by r* with f r * = 0. It follows 
that 7T° = ir r " . Substituting this into d46l ), we obtain V r g([i,r) _ = — S^o, which is negative-definite 
by Assumption (A[3]) and Lemma IIII.ll Therefore, by the Implicit Function Theorem, there is an open 
neighborhood U around \x = -tt , an open neighborhood V of r* , and a continuously differentiable 
function r : U — > V that satisfies g(/j,,r(fj,)) = 0, for \i 6 U. This fact together with Assumptions (AO 
and (A[3]) ensure that when fi £ U C\ B, the vector r(p) uniquely achieves the supremum in (l4TT >- 

Taking the total derivative of (l42l with respect to /i(z) we get, 



Consequently, when \i = tt°, 

dr{ji) 



(49) 



^~oi>{z). (50) 



/J=7T U 



These results enable us to identify the first and second order derivative of h{fj) = -D MM (/i||-7r ). Applying 
g(/j,,r(fj,)) = 0, we obtain the derivatives of h as follows, 





dfi(z 



Mri = /rw(4 (5i) 



afiSm™ = {Vrf ^ z))T W)- (52) 

When [i = 7T°, substituting ([50| in d52]), we obtain (l48l) . 

We now verify the remaining conditions required for applying Lemma IIII.5I 

(a) It is straightforward to see that h(ir°) = 0. 

(b) The function h is uniformly bounded since h(fi) = Z? MM (/u,||7r°) < D(/j,\\ir ) < max 2 log(^o^y) 
and 7T° has full support. 

(c) Since f r ^ = when fx = tt , it follows by (IBTI ) that -g^jjh(fi) = 0. 
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(d) Pick a compact K C U PiB so that K contains tt° as a relative interior point, and Kc{|ie ^(Z) : 
max n |^t(u) — vr°(u)| < ^ min^ |7r°(u)|}. This choice of K ensures that limn-^oo — - log P{S n £ 
K} > 0. Note that since r(/x) is continuously differentiable on U n B, it follows by ( TSTT ) and ( [521 
that h is C 2 on K. 

Thus the results on convergence of the bias and variance follow from Lemma IIII.5I 

The weak convergence result is proved using Lemma IIII.6I and Lemma UII.7I We observe that the 
covariance matrix of the Gaussian vector W given in Lemma Hll.61 is T,y/ = 3 = diag(7r°) — 7r°7r 0r . This 
does not have full rank since HI = 0, where 1 is the N x 1 vector of ones. Hence we can write, 

3 = GG T 

where G is an N x k matrix for some k < N. In fact, since the support of 7r° is full, we have k = N — 1 
(see Lemma [ill, lb . Based on this representation we can write W = GV, where V ~ A/"(0, 

Now, by Lemma IIII.6I the limiting random variable is given by U := \W T MW = \V T G T MGV, 
where M = V 2 D MM (^||7r°) = W^'EWy 1 ^. We observe that the matrix D = G T MG satisfies 
D = D. Moreover, since ^E^ T has rank d under Assumption (AS), matrix D also has rank d. Applying 
Lemma MI.7I to matrix D, we conclude that U ~ \x i d - 

2) Proof of part (ii):The conclusion of part (ii) is derived using part (i) and the decomposition in (|38T ). 
We will study the bias, variance, and limiting distribution of each term in the decomposition. 

For the second term, note that the dimensionality of the function class Q is also d. Applying part (i) 
of this theorem to Dg M (r n ||7r ), we conclude that its asymptotic bias and variance are given by 

lim Efn^rcnk 1 )] = U, (53) 

lim Var[nDg M (r n ||7r 1 )] = \d. (54) 

n— >oo 

For the third term, since Z is i.i.d. with marginal it 1 , we have 

E[<r n -7r\/ r ,)] = 0, (55) 

Var[n?(r"-7r\/ r .)] = Cov^x (/,.)■ (56) 

The bias result (l34l follows by combining (1531) . (|55T ) and using the decomposition (I38T ). To prove the 
variance result 051 ), we again apply the decomposition d38l ) to obtain, 

lim Var[n5Dr(r n ||7r )] = lim {Var [niDTCr 71 ^ 1 )} + Var [n^(T n - tt\ f r .)] 

[ni^^^Hvr 1 ) - E[Z?g M (r r5 '||vr 1 )]) 

ni(T n -7T\f r ,}]}. (57) 
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From (1541 ) it follows that the limiting value of the first term on the right hand side of (1571 ) is 0. The 
limiting value of the third term is also by applying the Cauchy-Bunyakovsky-Schwarz inequality. Thus, 
(1571 ) together with (|56]> gives 051 

Finally, we prove the weak convergence result (l36l ) by again applying the decomposition (l38l ). By (l53T > 
and (l54l ). we conclude that the second term L>g M (r n ||7r 1 ) converges in mean square to as n oo. 
The weak convergence of the third term is given in d39l ). Applying Slutsky's theorem, we obtain (l36l ). ■ 
Proof of Theorem \III.3\ The proof of this result is very similar to that of Theorem IIII.2I (ii) except 
that we use the decomposition in 071 ) with n = T n . We first prove the following generalizations of d53l ) 
and (|54l ) that characterizes the asymptotic mean and variance of the second term in d37l ) with fi = T n : 

lim E[nD^ M (r n ||7r)] = kracefc (58) 

lim Var[n^ M (r n ||7r)] = kracefE^f^r 1 ^^)" 1 ) (59) 

n— »oo 

where £/ = J 7 — f r i , and 7r is defined in the statement of the proposition. The argument is similar to that 
of Theorem |III.2| (i): We denote f r :=f r — f r i, and define h(fi) :=Dg M ([i\\ir) = sup r6Rti (/i(/ r ) — A^(/ r )). 
To apply Lemma IIII.5I we prove the following 

^(vr 1 ) = 0, (60) 

V M /i(7T 1 ) = 0, (61) 

and M = V^/ i (7r 1 ) = tf^E*)- 1 *. (62) 

The last two inequalities (loTT) and (l62l) are analogous to (IBTT ) and (l52l) . We can also verify that the rest 
of the conditions of Lemma IIII.5I hold. This establishes (l58l) and d59l ). 

To prove (l60l . first note that the supremum in the optimization problem defining D MM (7r 1 ||tt) is achieved 
by / r i, and we know by definition that / r i = 0. Together with the definition D™™ (n 1 \\tt) = 7r (/ r i) — 
Aff(/ r ), we obtain d60i 

Redefine g(n,r) := V r (^(/ r ) — A^(/ r )). The first order optimality condition of the optimization 
problem defining D MM (fi\\fc) gives g(fi,r) = 0. The assumption that T is a linear function class implies 
that f r is linear in r. Consequently V^/ r = 0. By the same argument that leads to (l44l ). we can show 
that 



Together with the fact that jyi = and V r / r = V r / r , we obtain 



(63) 



V r g(fj,,r) 



(64) 
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Proceeding as in the proof of Theorem IIII.2I (i), we obtain (IfjTT) and (l62~l ). 

Now using similar steps as in the proof of Theorem IIII.2I (ii). and noticing that log(^r) = f r i, we can 
establish the following results on the third term of 071 ): 



7T 



E[(r"-vrMog(-))] = 
Var[n*(r" - 7r 1 ,log(4))] = Cov ffI (/ r i) 

ni(r--7r 1 ,log(4)) ^.Cov^C/rO). 
Continuing the same arguments as in Theorem lHI.21 (i), we obtain the result of Theorem |III.3| 



A. Interpretation of the asymptotic results and performance comparison 

The asymptotic results established above can be used to study the finite sample performance of the 
mismatched test and Hoeffding test. Recall that in the discussion surrounding Figure Q] we concluded 
that the approximation obtained from a Central Limit Theorem gives much better estimates of error 
probabilities as compared to those suggested by Sanov's theorem. 




Probability of false alarm p FA -» 
Fig. 4. Comparison of ROCs of Hoeffding and mismatched tests. 



Suppose the log-likelihood ratio function log(7r 1 /7r°) lies in the function class F. In this case, the 
results of Theorem IIII.2I and Lemma IIII.4I are informally summarized in the following approximations: 
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With T n denoting the empirical distributions of the i.i.d. process Z, 




.0 



1 



(65) 



where is i.i.d., 7V(0, 1), and U is also iV(0, 1) but not independent of the W^'s. The standard 

deviation a\ is given in Theorem IIII.2I These distributional approximations are valid for large n, and are 
subject to assumptions on the function class used in the theorem. 

We observe from (l65l ) that, for large enough n, when the observations are drawn under tt°, the 
mismatched divergence is well approximated by ^ times a chi-squared random variable with d degrees of 
freedom. We also observe that when the observations are drawn under tt 1 , the mismatched divergence is 
well approximated by a Gaussian random variable with mean D^Wn ) and with a variance proportional 
to i and independent of d. Since the mismatched test can be interpreted as a GLRT, these results 
capture the rate of degradation of the finite sample performance of a GLRT as the dimensionality of the 
parameterized family of alternate hypotheses increases. We corroborate this intuitive reasoning through 
Monte Carlo simulation experiments. 

We estimated via simulation the performance of the Hoeffding test and mismatched tests designed 
using a linear function class. We compared the error probabilities of these tests for an alphabet size of 
N = 19 and sequence length of n = 40. We chose tt° to be the uniform distribution, and tt 1 to be the 
distribution obtained by convolving two uniform distributions on sets of size (N + l)/2. We chose the 
basis function ipi appearing in (l26l ) to be the log-likelihood ratio between tt 1 and 7r°, viz., 



and the other basis functions ^ 2 ,ip3, ■ ■ ■ ,ipd were chosen uniformly at random. Figure [4] shows a com- 
parison of the ROCs of the Hoeffding test and mismatched tests for different values of dimension d. 
Plotted on the x-axis is the probability of false alarm, i.e., the probability of misclassification under 7r°; 
shown on the y-axis is the probability of detection, i.e., the probability of correct classification under tt 1 . 
The various points on each ROC curve are obtained by varying the threshold r] used in the Hoeffding 
test of (0]) and mismatched test of ( fT9i 

From Figure |4] we see that as d increases the performance of the mismatched tests degrades. This is 
consistent with the approximation (l65l) which suggests that the variance of the mismatched divergence 
increases with d. Furthermore, as we saw earlier, the Hoeffding test can be interpreted as a special case of 



^l(Zi) = log 




1 < i < N 
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the mismatched test for a specific choice of the function class with d = N — 1 and hence the performance 
of the mismatched test matches the performance of the Hoeffding test when d = N — 1. 

To summarize, the above results suggest that although the Hoeffding test is optimal in an error-exponent 
sense, it is disadvantageous in terms of finite sample error probabilities to blindly use the Hoeffding test 
if it is known a priori that the alternate distribution belongs to some parameterized family of distributions. 

IV. Conclusions 

The mismatched test provides a solution to the universal hypothesis testing problem that can incorporate 
prior knowledge in order to reduce variance. The main results of Section [III] show that the variance 
reduction over Hoeffding's optimal test is substantial when the state space is large. 

The dimensionality of the function class can be chosen by the designer to ensure that the the bias and 
variance are within tolerable limits. It is in this phase of design that prior knowledge is required to ensure 
that the error-exponent remains sufficiently large under the alternate hypothesis (see e.g. Corollary III, lb . 
In this way the designer can make effective tradeoffs between the power of the test and the variance of 
the test statistic. 

The mismatched divergence provides a unification of several approaches to robust and universal 
hypothesis testing. Although constructed in an i.i.d. setting, the mismatched tests are applicable in very 
general settings, and the performance analysis presented here is easily generalized to any stationary 
process satisfying the Central Limit Theorem. 

There are many directions for future research. Topics of current research include, 

(i) Algorithms for basis synthesis and basis adaptation. 

(ii) Extensions to Markovian models. 

(iii) Extensions to change detection. 

Initial progress in basis synthesis is reported in ll22l . Recent results addressing the computational com- 
plexity of the mismatched test are reported in |20l . Although the exact computation of the mismatched 
divergence requires the solution of an optimization problem, we describe a computationally tractable 
approximation in ll20l . We are also actively pursuing applications to problems surrounding building 
energy and surveillance. Some initial progress is reported in |23i 
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Appendix 

A. Excess codelength for source coding with training 

The results in Theorem IIII.2I give us the asymptotic behavior of D(r n ||7r) but what we need here is 
the behavior of D{ir\\T n ). Define 

£>(7T||/i) if M G P e/2 
£>(7r||7r 1 ') else 



It is clear that h is uniformly bounded from above by log -. Although h is not continuous at the boundary 
of P e /2> a modified version of Lemmas UII.5I and IIII.6I can be applied to the function h to establish the 
results of (fT6l ) following the same steps used in proving Theorem IIII.2I The Hessian matrix M appearing 
in the statement of the lemmas is given by, 

M = V 2 /i(^) = diag(vr)- 1 . 

Hence, trace(Mfi) = trace(MOMO) = N - 1. 

B. Proof of Lemma \III.4] 

Proof: In the following chain of identities, the first, third and fifth equalities follow from relation 
(l25l) and Proposition 111.51 

DT(fMh°) = D(fj,\\TT°) — ini{D(fj,\\u) : u = tt° exp(/ — A 7r o(/)), / £ J 7 } 

= D(ji\\it) + (M,log(4)) " M{D(j*\W) ■ " = ^exp(/ - A#(/)), / e <?} 

= D^\\n) + {fJ,- ^,log(4j)) + ^(^Ik ) - Di^Wn) 



D^WZ) + < M - tt 1 , log(^)> + (vr 1 !!^ 



C. Proof of Lemma \III.5\ 

The following simple lemma will be used in multiple places in the proof that follows. 

Lemma A.l. If a sequence of random variables {A n } satisfies E[A n ] > a and {E[(A n ) 2 ]\ is 

n— >oo 

a bounded sequence, and another sequence of random variables {B n } satisfies B n m ' s ' > b, then 
E[A n B n ] > ab. □ 
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Proof of Lemma \HI.5\ Without loss of generality, we can assume that the mean x is the origin in 
W n and that h(x) = 0. 

Since the Hessian is continuous over the set K, we have by Taylor's theorem: 

n(h(S n ) - Vh(x) T S n )I {S n eK} = n[h(x) + ±S nT V 2 h(S n )S n ]I {S n eK} (66) 

= -S nT V 2 h(S n )S n I {SneK} (67) 

where S n = jS n for some 7 = 7(71) € [0, 1]. By the strong law of large numbers we have S n °" s ' > x. 

n— >oo 

Hence S n a ' s ' > x and V 2 h(S n ) °" s ' > V 2 h(x) = M since V 2 /i is continuous at x. Now by the 

n— >oo n— >oo 

boundedness of the second derivative over K and the fact that 

_ a.s. ^ 

HS"GK} — — > 1 
L ' n— >oo 

we have (V 2 h(S n )) hJ I {Sn&K} M id . 

Under the assumption that X is i.i.d. on the compact set X, we have 

E[nSfS]} = Eij for all n, 

and E[(n5 i n S'™) 2 ] converges to a finite quantity as n — > 00. Hence the results of Lemma lATTl are applicable 
with A n = nS?S* and B n = V 2 h(S n ) itj I {S n eK} , which gives: 

ElnSfSf-^HS^I^K}} ► ZijMij. (68) 

Thus we have 



r n 



E[n(/ l (5") - V/^WV^}] = E[-S nT V'h(S n )S n l {SneK} ) 

► ±trace(M~). (69) 

Since X is compact, h is continuous, and h is differentiable at x, it follows that there are scalars h and 
x such that sup x . 6X \h(x)\ < h and \Vh(x) T S n \ < x. Hence, 

\E[n(h(S n ) - Vh(x) T S n )Ii S „ iK x}\ < n(h + x)P{S n 4 K} ► (70) 

where we use the assumption that the P{5 n ^ K} decays exponentially in n. Combining d69b and (TTOl) 
and using the fact that S n has zero mean, we have 

E[nh(S n )] = E[n(h{S n ) - Vh(x) T S n )] > ±trace(MS). 

n— >oo 

This establishes the result of (i). 
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Under the condition that the directional derivative is zero, (1671 ) can be written as 



h(S n )I {S n eK} = ^S nT V 2 h(S n )S n I {s , eK} . (71) 



Now by squaring (17 lb . we have 

{nh{S n )l {s ^ K] f = y E [s?^ 2 h(SnkjS^S^V 2 h(Sn)k,iSn { s^K} 
As before, by the boundedness of the Hessian we have: 

(V 2 / i (^)) iJ (V 2 / l (S")) fci ,I {s „ 6/ , } M hj M k/ 



It can also be shown that 

E[n 2 S"S"S£Sa = + EyEfc^ + E,- fc E w + E i)fc E^ for all n 

where Fjj^j = E[X^ Xj-XjJXj]. Moreover, E[(n 2 5f S^S^S™) 2 ] is finite for each n and converges to a 
finite quantity as n — > oo since the moments of X 1 are finite. Thus we can again apply Lemma IX. II to 
see that 

E[n 2 5rV 2 / i (5") M ^^V 2 / l (S ;n ) fc ,,^I {s „ 6/ , } ] 
{^i,j^k,i + Ej^Ej^ + Ei ! fcEj j £)M i jMfc i £. 



Putting together terms and using (1711 ) we obtain: 

E\(nh(S n )) 2 Ir s ^K\] > itrace(MHMH) + ±(trace(M~)) 2 . 

Now similar to (TTOl ) we have: 

lEf^S")) VWI < n 2 7?P{^ i K} > 0. (72) 

Consequently 

E[{nh(S n )) 2 } ► itrace(MHMS) + ±(trace(ME)) 2 

n— >oo 

which gives (ii). 
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D. Proof of Lemma \III.6\ 

We know from © that T n can be written as an empirical average of i.i.d. vectors. Hence, it satisfies 
the central limit theorem which says that, 

ni(T n -fi) — W (73) 



where the distribution of W is defined below (140b . 

Considering a second-order Taylor's expansion and using the condition on the directional derivative, 
we have, 

n(h(T n ) - Mm)) = §n((T" - ^V 2 h(t n ){T n - /,)) 

where T n = ^yT n + (1 — j)fi for some 7 = 7(71) E [0, 1]. We also know by the strong law of large 
numbers that F n and hence T n converge to ji almost surely. By the continuity of the Hessian, we have 



W z h{T n ) V 2 h{fi). (74) 

n— >oo 

By applying the vector-version of Slutsky's theorem ll24l . together with (l73l and d74l . we conclude 

n((r n - fi) T V 2 h(f n )(T n - fi)) ±W T V 2 h(fi)W, 

n— >oo 

thus establishing the lemma. 

E. Proof of Lemma \111. 71 

Proof: The assumption that D is a projection matrix implies that D 2 = D. Let {u 1 , . . . , u m } denote 
an orthonormal basis, chosen so that the first K vectors span the range space of D. Hence Du l = u l for 
1 < i < K, and Du l = for all other i. 

Let U denote the unitary matrix whose m columns are {u l , . . . ,u m }. Then V = UV is also an 
M(0, I m ) random variable, and hence DV and DV have the same Gaussian distribution. 

To complete the proof we demonstrate that ||Dy|| 2 has a chi-squared distribution: By construction the 
vector Y = DV has components given by 



Yi 



Vi 1 < i < K 
K < i < m 



It follows that ||y|| 2 = H-DVK 2 = V 2 + ■ ■ ■ + V^ has a chi-squared distribution with K degrees of 
freedom. ■ 
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